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^! Abstract 

| We demonstrate and explicate Bayesian methods for fitting the parameters that encode the 

impact of short-distance physics on observables in effective field theories (EFTs). We use Bayes' 
theorem together with the principle of maximum entropy to account for the prior information that 
\& ■ these parameters should be natural, i.e. 0(1) in appropriate units. Marginalization can then be 

employed to integrate the resulting probability density function (pdf) over the EFT parameters 
that are not of specific interest in the fit. We also explore marginalization over the order of the 
EFT calculation, M, and over the variable, R, that encodes the inherent ambiguity in the notion 
that these parameters are 0(1). This results in a very general formula for the pdf of the EFT 
parameters of interest given a data set, D. We use this formula and the simpler "augmented 
X 2 " in a toy problem for which we generate pseudo-data. These Bayesian methods, when used in 
combination with the "naturalness prior" , facilitate reliable extractions of EFT parameters in cases 
where x 2 methods are ambiguous at best. We also examine the problem of extracting the nucleon 
mass in the chiral limit, Mq, and the nucleon sigma term, from pseudo-data on the nucleon mass 
as a function of the pion mass. We find that Bayesian techniques can provide reliable information 
on Mq , even if some of the data points used for the extraction lie outside the region of applicability 
of the EFT. 
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I. INTRODUCTION 



Effective field theory (EFT) methods allow the treatment of problems in which there 
is a separation of scales. In these theories dynamics at the low-energy scale, m, say, is 
incorporated explicitly in the theory, while the degrees of freedom that enter the problem 
at the high-energy scale, A, are integrated out. (See, Refs. [l|, 0, 0, 0] for pedagogical 
introductions to EFT.) The impact of modes with p ~ A on dynamics for p ~ m is then 
accounted for via a sequence of contact operators of increasing dimension. If there is no pre- 
determination as to which operators appear in this sequence then the theory is free of model 
assumptions about the high-energy dynamics. Therefore, in general, all contact operators 
consistent with the symmetries that are applicable at the scale p ~ m should be included in 
the EFT expansion. The coefficients of these admissible contact operators encode the impact 
of high-energy physics on low-energy observables in a systematic and model-independent 
way. Observables corresponding to momenta p ~ m can be computed as an expansion in 
powers of m/A, and the resultant formulae are model-independent predictions, depending 
only on the existence of the scale separation and the symmetries of the low-energy theory. 

One popular application of EFT is to low-energy QCD. In this case the scale separation 
is between the mass of the pion, the pseudo-Goldstone boson of QCD's spontaneously- 
broken (approximate) chiral symmetry, and the masses of other hadronic degrees of freedom. 
The EFT which incorporates chiral symmetry and encodes this scale separation is known 
as chiral perturbation theory (%PT) @, @, 0, 0, E3, HI- The xPT expansion for a 
hadronic observable is then an expansion in powers of m/A, with loop diagrams introducing 
non- analytic dependence on this expansion parameter. 1 The dynamics at scale A impacts 
this expansion through certain coefficients which are not determined a priori. The low- 
energy symmetries of QCD mandate that once determined in one process these parameters — 
the "low-energy constants" (LECs) of %PT — will appear in other processes too, thereby 
giving xPT predictive power once the LECs at a given order are known. 2 There are some 
instances in which an LEC can be rigorously computed from the underlying theory, but 
lattice calculations which do this for low-energy QCD exist in only a very few cases. In this 
situation the only model-independent way to find the LECs is to fit them to experimental 
data. Such parameter estimation is thus a crucial component of %PT, and indeed of all EFT 
programs. 

The standard method of determining LECs from data is to perform a fit using the EFT 
expansion of that physical quantity at a fixed order, employing techniques such as least 
squares or maximum likelihood. But here we face several dilemmas as regards the "best" 
way to obtain the LECs, including: 



1 Here we are concentrating on an observable at a particular kinematic point. can, of course, also be 
used to compute the low-energy dependence of observables on £/A, where £ is any kinematic parameter 
with dimensions of mass. 

2 Note that here and throughout we are using the term "low-energy constants" to refer to the coefficients 
in the EFT expansion of a physical observable. This is different to the oft-employed meaning of "LEC" 
as the coefficient of an operator in the EFT Lagrangian. We have chosen not to adopt that meaning here 
since the values of those coefficients depend on the conventions used in the Lagrangian: the interpolating 
fields, set of independent operators written down at a given order, etc. In contrast the coefficients in the 
EFT expansion of S-matrix elements are independent of all such choices. 
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1. Which data should be used to determine the LEC? More data is available as the 
maximum energy of the data set is increased, but the reliability of a fixed order EFT 
calculation decreases as the energy is increased. 

2. What order of EFT calculation should be used to extract the LEC? The first one 
at which that LEC appears, or the highest one to which the expansion has been 
computed? 

3. How should prior constraints on LECs (e.g. from the requirement of "naturalness" 
with respect to the scale A, or from other processes) be incorporated into the fit? 

In an ideal situation none of these dilemmas matter, and all fitting paths lead to the same 
LEC (within errors). But if only somewhat imprecise experimental data is available in the 
region of validity of the EFT then the extracted LEC can be significantly sensitive to the 
manner in which the fit is done. 



In this paper we argue that Bayesian methods (see e.g. Refs. [12j, [13J]) are ideal for 
parameter estimation of LECs in EFTs, and that they resolve all the above dilemmas. In 
the Bayesian approach the central object is the posterior probability distribution function 
(pdf) for the LECs of interest, say ao and cti, and we want their joint, conditional distribution 
given a data set D: pr(ao, cli\D). Bayes' theorem gives us the following relation between 
this and the more-usually computed pr(D|a , ai): 

i m pr(D|a ,ai)pr(a ,ai) 

pr{a ,a 1 D) = — . 1 

pr(L>) 

Here the first factor on the right-hand side is the "likelihood" that is minimized in a x 2 01 
least-squares approach. It is through the second factor that prior information can be incor- 
porated in the fit. (The factor in the denominator may be determined by the requirement 
of a normalized pdf for pr(ao, ai\D).) 

The fact that EFTs intrinsically depend on scale separation means that in an EFT fit 
there is information available on the size of LECs prior to the analysis of the data. In 
a standard, perturbative EFT with one high-energy scale the LECs should be "natural" 
with respect to the scale A, i.e. 0(1) when measured in units of A. 3 Consequently we 
begin by encoding the fact that the LECs ao, . . . , ayi should be natural through the prior 
pr(a). We choose a M + 1-dimensional Gaussian prior, of width R, since that is the least 



informed prior if we know the expectation value of Y2i a i UM ■ This yields a version of the 



"constrained curve fitting" method recently advocated for lattice QCD data by Lepage [20 



Morningstar [2J| and others (see e.g. Refs. [22|, |23j). 

Constrained curve fitting therefore amounts to the computation of a posterior pdf with 
a Gaussian prior at fixed order M. We can eliminate sensitivity to the precise value of 
R by averaging the resulting pdf over a range of R values, thereby incorporating in our 
result the inherent ambiguity in the notion of "(9(1)" LECs. "Marginalizing" in this way 
over unwanted parameters is a technique used to obtain posterior pdfs that incorporate the 
uncertainty that results from systematic differences in the parameters which are obtained 
when fits are done in different ways. Therefore we also marginalize over the order M of the 



The situation can be complicated by the presence of additional scales [3, H], and/or infra-red fixed 



points 



1G 



17 



18J but the underlying fact of prior information on the LECs remains true. 
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EFT calculation. This necessitates additional marginalization over the LECs that appear 
at orders M > 1 and so amount to "nuisance parameters" in our effort to extract ao and a%. 
Our final formula for the posterior pdf pr(ao, a\\D), Eq. (1351) . therefore involves a sum over 
M, as well as integrals over R and 0*2, ■ ■ ■■ The resulting central values and uncertainties 
for ao and a\ incorporate a rigorous accounting of the theoretical uncertainty present in a 
low-order EFT fit. 

In Section [III we review Bayesian methods as well as the standard maximum-likelihood 
technique, derive the formula for the "augmented x 2 " — a X 2 which penalizes unnatural 
values of the fit parameters — and derive our final formula — see Eq. (135]) — for pr(ao, ai\D). 
In Sec. IIHI we apply both the augmented \ 2 and the formula fl35|) to the "toy" problem 
of extracting the coefficients of a power-series approximant to the function g[x) = (| + 
tan(7r/2x)) 2 from pseudo-data that is statistically distributed around the curve g. We show 
that these two Bayesian methods provide an extraction which does not depend on the interval 
over which the fit is performed. We also show how they avoid certain ambiguities that are 
present in ^-minimization. And we find that reliable information on power-series coefficients 
can be gleaned from the data even in cases where standard techniques are powerless. 

In Sec. IIVI we generate pseudo-data with small errors from the xPT function for the 
nucleon mass as a function of the pion mass. We show that methods based on Eq. (|35|) 
are capable of determining the nucleon mass in the chiral limit, Mo, from pseudo-data in 
the pion-mass range m = 200-500 MeV. We follow this in Sec. |V] with a similar analysis of 
pseudo-data generated from an underlying function M^(m) that deviates from the xPT form 
above 500 MeV. Bayesian extractions of M from such pseudo-data generated in different 
ranges of m yield consistent results. This conclusion holds even for fit windows that extend 
significantly above the pion mass where the ;^PT from for M^(m) ceases to be valid. 

These three problems are variants of polynomial regression using Bayesian methods, a 
problem that has received extensive treatment in the literature. A Gaussian prior for the 



polynomial coefficients is a common choice [24j, |25j. However, many authors [2J, |26| also 
incorporate the 'commonly held belief that the [coefficients] will tend to decrease in absolute 
value as the order increases' (24| . The idea of marginalizing over the order of the polynomial 



fit is also not new. It is discussed extensively in studies of "Bayesian Model Averaging", 



e.g. Refs. [271 . |28| . In EFT applications the basis for regression includes non-polynomial 
functions and, as will be discussed extensively in Sec. IHff the fit form only has a limited 
region of applicability. These peculiarities make the problem of parameter estimation in 
EFTs one that has — as far as we can tell — evaded treatment in the Bayesian literature until 
now. 

On the EFT side Bayesian methods have been applied to chiral extrapolations of quenched 
lattice data on the nucleon mass as a function of the pion mass in Ref. [29]. In this work a 
technique based on using priors obtained from a subset of the lattice data in the analysis of 
the full data set was employed. Such iterative methods are frowned upon in the Bayesian 
literature [13]. Meanwhile, Trottier et al. [3(| have — among other applications — employed 
Bayesian methods to perform extrapolations of the static-quark self energy as a function of 
the lattice size L. The extrapolant in this problem can be computed using lattice effective 
field theory and Trottier et al. used "constrained curve fitting" to stabilize the fit of certain 
coefficients in that EFT expansion, thereby incorporating some prior information regarding 
the naturalness of coefficients in the EFT expansion in their fit. However, hadronic observ- 
ables, such as the behavior of the nucleon mass as a function of m, were not considered. 

This paper seeks to develop a general strategy for parameter estimation in EFTs: one 
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based only on the expectation that these parameters are "natural" with respect to the 
underlying scale. We believe that such priors provide a more stable and reliable estimation 
of coefficients than standard fitting techniques. This conclusion seems very general, and 
should apply to a wide variety of EFT situations. We discuss those conclusions in Sec. ED 
and give a sampling of possible applications. 



II. BAYESIAN PROBABILITY THEORY 

In this section we outline the way in which we will use data to estimate the parameters 
in an EFT. After a brief review of the standard maximum-likelihood technique we describe 
the basics of Bayesian probability theory and explain how it can be used for this problem. 
In particular we show how prior information on an EFT's low-energy constants can be 
systematically included in the data analysis, and how the impact of higher-order effects on 
the extracted LECs can be accounted for by marginalization. 

Throughout this section we denote a set of data on a particular observable by D = 
{{dk, crfc) '. k = 1, . . . , N}, with dk an individual measurement at point Xk and <j\~ the corre- 
sponding uncertainty. The functional form which we want to use to describe the observable 
is given by f(x, a), where f(x, a) depends on a set of EFT parameters a = {ao, . . . , «m} 
which we wish to determine from the data set D. In any EFT application (and in many 
others too!) f(x, a) need not and should not be assumed to be the correct functional form 
for the observable for all x. Instead we only assume that there is some x domain, x < p say, 
where / can be systematically improved via the addition of more terms (and hence more 
parameters), p would then be the breakdown scale of the EFT expansion represented by /. 



A. Maximum likelihood 

The maximum-likelihood method is often used to determine the unknown parameters a. 
In this method one tries to find those values of the parameters that maximize the probability 
of generating the data set D, assuming that f(x, a) is indeed the true theory, i.e. we seek 
to find a = ao such that 

w(D\a J) (2) 

is maximum, where 

W (X\Y) 

denotes the probability of X given Y. (Here and below the notation pr(D|a, /) is used 
to specify both the functional form f(x, a) as well as the particular values of a.) The 
maximum-likelihood method simplifies further if the data are independent and the noise 
due to measurement is Gaussian. In this case the probability of finding the data given the 
underlying functional form f(x, a) can be written as 



P^|a )/ ) = n(^)ex P ( 



A 2 



(3) 



where 

N / i r/ \\ 2 



fc=i 
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Finding the maximum of pr(D\&, f) is equivalent to minimizing x 2 , giving justification to 
the widely used method of least-squares. 

In fact, the least-squares likelihood of Eq. ([3]) is the least biased pdf in the case that the 
means dk and variances of the TV uncorrelated measurements are known. This statement 
can be proven via the "principle of maximum entropy" [33J which states that the least-biased 
pdf is found by maximizing 

(5) 

under the constraints of the available information, where m(x) is a 'measure' for the maxi- 
mum entropy calculation (see App. IA~j) . 



S = — / dx pr(x) log 



pr(x) 



m[x) 



B. Bayesian approach 

The above approach, while standard, has three shortcomings as far as the particular 
application we have in mind is concerned. The first is that maximizing Eq. is not 
exactly the problem we want to solve. It assumes that the theory, in particular the values of 
the parameters a, is given, while in fact the data are given and we want to infer values for 
a. Mathematically we are really interested in pr(a|D, /) (provided we assume a particular 
functional form / to describe the data). The question of which of these two probabilities we 
should try to maximize is related to a long-lasting discussion about the correct interpretation 



of probability, and we do not wish to comment on this here (see e.g. [13l . I31L |32| ) . As we 
will see below, the two probabilities are related by Bayes' theorem. The second issue is of 
a more practical nature. The maximum-likelihood approach assumes no prior knowledge 
of the values of the parameters a. While there are circumstances in which this is indeed 
appropriate, there are other cases in which information on the parameters a is available 
before the data analysis. This information could come from the naturalness arguments 
mentioned above, from symmetry arguments, or via constraints from other experimental 
data. Incorporating such knowledge into pr(a|£),/) refines the EFT parameter estimate 
obtained from the data set D, and is very easy within the framework of Bayesian statistics 
12| . Finally, the entire discussion thus far assumes a particular functional form for /. A 
more general approach would allow the extraction of relevant LECs from data using different 
EFT forms, for example obtained by carrying the EFT calculation to different orders in the 
m/A expansion. Such marginalization is straightforward once pr(D|a, f) is in hand. It 
amounts to standard manipulations of conditional probabilities (see e.g. jl2|). 

Bayes' theorem relates the probability of a certain parameter set being correct given a 
set of data D, pr(a|D, /), to the probability pr(D|a, /) of obtaining the data D given the 
theory f(x, a) with a specific a, 

pr(a|A/)- Pr(g|a : / n ) , P f f l/) . (6) 
pr(D|/) 

Here, pr(a|D, /) is referred to as the posterior pdf, while pr(a|/) is called the prior pdf and 
incorporates information on the parameters a that we have prior to analysis of the data. 
pr(£)|/) is the probability to find the data D regardless of the specific values of the a* and 
can often be absorbed in a normalization constant, yielding 

pr(a|A/)cxpr(D|a,/)pr(a|/). (7) 
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Here we will take the aj's to be dimensionless. In that case they are so-called "location" 
parameters, and if there is no prior information on a then pr(a|/) should be taken to be a 
constant [12J. Consequently one finds that 

pr(a|D,/)txpr(D|a,/), (8) 

which leads us back to the method of maximum likelihood described in the previous section. 

However, in the case that prior information on the parameters is available a constant 
pr(a|/) is not appropriate and one has to decide how to incorporate the available information 
in the prior pdf. This is not always straightforward, as can be seen by the example of 
naturalness that is of interest to us here. In that case the information we want to encode 
in pr(a|/) is that the parameters a are supposed to be natural, i.e. of order 0(1). However, 
neither of these statements gives us much guidance as to what form to choose for pr(a|/). 
For instance, one way to incorporate the fact that ~ 0(1) would be to assign a uniform 
prior in the region —5 < Oj < 5: 

«»W/> = { t "otherwise 5 ' < 9 > 

However, this is a very strict prior outside the range [—5, 5], and while LECs with magnitude 
larger than 5 might not be ideal for the convergence of the EFT it is not clear that they 
should be rejected entirely. 

We will incorporate naturalness in a less restrictive form. We choose the function f(x, a) 
to contain M + 1 parameters a«, and we assume only that / is linear in these parameters, 4 
i.e. we write 

M 

f(x,a) = ^2a j f j (x), (10) 

3=0 

where the fj(x) are basis functions, e.g. monomials of order j. We will refer to M as the 
order of the EFT calculation. So as to simplify the notation we replace e.g. pr(D|a, /) 
by pr(D|a, M), since specifying the order M usually defines / in a given EFT. We then 
interpret the naturalness assumption as a constraint on the ensemble average of the sum of 
squares of the coefficients f\T 

JdsLa 2 pi(a\M,R) = (M + 1)R 2 , (11) 

where R encodes our interpretation of what "(9(1)" means. We want to find the least 
informed pdf that incorporates the information in Eq. (|11|) . As discussed above this can 
be achieved by the application of the maximum entropy principle. Using the constraint of 
Eq. (jUJ we arrive at the prior pdf (see App. [X]) 




pr( a|M, fl )=(^ s l expl-— ), (12) 



4 With our definition observables are always linear in the LECs. However, for the standard definition of the 
term LEC this no longer holds. The techniques developed here can be extended to non- linear dependence 
on the parameters but that makes the analysis more complicated, and so we defer that case to a future 
study. 
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which is a multivariate Gaussian distribution with mean \i = and standard deviation R 
and can thus be written as 

pr(a|M, R) = (f[ -±A exp , (13) 



with 

M o 

af 



Xprior ^ ^ d2 ' (1^0 
i=0 

Note that since our testable information ( TTTT) did not include any statement about corre- 
lations between the LECs we have obtained a Xprior * n which the Oj's are uncorrelated. If 
correlations between different coefficients are known to exist then they should (and can) be 
part of the testable information provided to the maximum-entropy principle. 

Combined with a Gaussian likelihood function pr(Z)|a, /) the probability of finding a 
theory f(x, a) given the data D again assumes a Gaussian form, 

pr(a|AM,i?)aexp(-^y (15) 

where we follow Ref. 2(| and introduce an "augmented x 2 ", 

"X-aug X Xprior' 

(16) 

The expectation values of the parameters a, are then determined by finding the maximum 
of the probability pr(a|D,M, R) — which is equivalent to a least-squares problem with x 2 
replaced by xl ug - This form has the advantage that techniques developed for the stan- 
dard least-squares approach can be adopted, and certain manipulations can be performed 
analytically, as we will now demonstrate. 

The standard x 2 can De written in matrix form as 

X 2 = a T Aa - 2b ■ a + C, (17) 

where the (M + 1) x (M + 1) matrix A is defined as 

N 1 

A U = '52-2M x k)fjM, *,J = 0,...,M, (18) 

k=l 

the (M + l)-component vector b is given by 

- 1 

bi = 22-zd k fi{x k ), i = 0,...,M, (19) 

k=l U k 

and 

N 1 

C = J2-,4. (20) 



k=i °* 



The augmented x 2 can be written in similar form by simply replacing the matrix A by A 



aug 5 



A aug = A+^l, (21) 
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where X is the (M + 1) x (M + 1) identity matrix. The minimum of x\ U g * s then simply 
given by 



From these formulae it is easy to see how the size of R influences the result of the 
regression. As long as 1/R 2 <C \ m in, where X min is the smallest eigenvalue of A, it will 
have only very little effect on the extraction of a . In this case the constraint by the 
prior information is very weak. However, for the case that 1/R 2 is much larger than the 
smaller eigenvalues of A, i.e. R is small, the prior information of Eq. (TlTI) amounts to a 
strong constraint on the allowed parameter values and will dominate the solution ao. The 
augmented \ 2 is thus of most use when R 2 ~ 1/X min . The naturalness constraint can then 
help to refine and distinguish between what would otherwise be shallow and/or equivalent 
minima in the x 2 hypersurface. 

C. Marginalization 

In general the theory underlying the data can depend on a large number of parame- 
ters. When we are only interested in a subset of these parameters the other ("nuisance") 
parameters can be eliminated from the analysis by marginalization. 

Suppose that the theory depends on the parameter sets X and Y, where X stands for 
the parameters of interest while Y denotes all nuisance parameters. Standard probability 
theory tells us that the pdf pi(X\D) can be obtained by summing/integrating the probability 
pr(X, Y\D) over all possible values of Y, 



It is interesting to note a similarity between marginalization and effective field theory. In 
both cases one "integrates out" those degrees of freedom one is not explicitly interested in 
(Y and heavy degrees of freedom, respectively) and takes their contributions into account 
implicitly. As we shall now see, in the case of linear dependence of / on the parameters the 
similarity is particularly striking as the marginalization involves a Gaussian integral over 
the irrelevant degrees of freedom. 

Once the marginalized pdf is obtained, the problem of estimating the parameters X is re- 
duced to a lower dimensionality, thereby reducing the numerical cost. Finding the marginal- 
ized pdf has its own cost; however, for the case of a Gaussian posterior the marginalization 
integral can be performed analytically. We will be interested in the case where X stands for 
a subset of low-order parameters a res = (a , . . . , a r -i), and Y denotes higher-order param- 
eters Simarg = ( a r, ■ ■ ■ i o-m) with a = (a res , a marg ). We want to obtain the marginalized pdf 
pr(sL res \D, M, R), which is given by marginalization of the posterior of Eq. (fl5|) over SL marg , 




(22) 




(23) 




(24) 



As explained in the previous subsection we can write xl ug as 

xLa = a T A ans a - 2b • a + C. 



(25) 
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Performing the integration over the parameters a mar9 one again obtains a Gaussian pdf, 



pr(a res |D, M, R) oc exp 



(26) 



where T and /? are related to A aug and b by 

T = A 1 -A 2 (A 4 y 1 A 3 , (27) 
/5 = b res — h marg (A A ) 1 A 3 , (28) 



A au9 = [ A A ] ^ ) > (29) 



where b = (b res ,b mars ) and 

A -I 

A 3 Ai 

with Ax an rxr, A 2 an r x (M+l— r), A 3 an (M+l— r) xr, and A 4 an (M+l— r) x (M+l— r) 
matrix, respectively. The estimates for the parameters a res are now given by 



x res 



Y- l (3. (30) 



A straightforward calculation (see App. [B]) shows that these results for a res0 are identical 
to the ones obtained from the non- marginalized posterior pr(a|D, M, R). The first rxr 
entries in the covariance matrix are also unaffected. Therefore marginalization has no effect 
on the parameter estimates in the case that the posterior pdf is Gaussian. But, for a general 
posterior pdf, the estimates of the parameters a res after marginalization can differ from the 
ones obtained from the unmarginalized pdf. 

The marginalized probability pr(a res |M, R, D) still depends on M and R, i.e. the choice as 
to which order of the EFT expansion is used to obtain the fitting function and the meaning 
of what is really "natural" for the a,'s. Neither the exact order of the polynomial from which 
we estimate a res nor the exact value of R are of specific interest to us. Ultimately we are 
only interested in what the data can tell us about the value of a res , and the pdf of interest 
is really pr(a res \D), i.e. we want to eliminate M and R. We now show how the probability 
pr(sL res \D) can be obtained from the familiar likelihood pr(D|a, M) by marginalization and 



Bayes' theorem (also see Ref. [19| for marginalization over R) 



To construct pr(a res \D) we marginalize M and R over suitable domains: 



pr(a res | J D) = dR M*res, M,R\D). (31) 



M 

Using Bayes' theorem we can rewrite the right-hand side as 

M ma 



E 

M=r 



pr(D|a res ,M, J R)pr(a res ,M, J R) 

dR — . (32) 

-pr(D) 



In an Mth-order calculation with M > r — 1 there are additional parameters in the EFT 
function /. Thus, to calculate pr(D|a res , M, R), we introduce them by marginalization: 



pr(D|a res , M,R) = J da. marg pr(L>, a mar9 |a res , M, R) 

= J da mar g pr(Z)|a res , a mar9 , M, i?)pr(a mar9 |a res , M, R). (33) 
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Inserting Eq. (|33|) in Eq. (|32l) we find 

( f /" j pr(D|a,M,i?)pr(a mar Ja res ,M,#)pr(a res ,M,i?) 

pr(a res |-D) = 2_j J dR J d3L mar 9 ^(D) ' ^ ^ 

The first probability in the numerator should be independent of our choice of R, i.e. 
pr(D|a, M, R) = pr(D|a, M), and the last two terms in the numerator of Eq. (|34p can 
be rewritten as 

pr(a marff |a res , M, i?)pr(a res , M, R) 

= pr(a mar9 |a res , M, i?)pr(a res |M, R)pi(M, R) 
= pr(a|M, i?)pr(M, R) 
= pr(a|M, J R)pr(M)pr( J R), 

where in the last step we have used that M and R should be independent of each other. 
This gives as our final pdf 



Mma x l>R max r 

pr(a res |L>) = dR 



pr(D|a, M)pr(a|M, fl)pr(M)pr(fl) 



Eq. (1351) is a key result of this paper. Its derivation employs only Bayes' theorem and 
the standard rules of probability. It thus encodes, in a completely general way, an EFT 
fitting strategy that accounts for systematic differences in fits due to results obtained with 
different EFT orders. It can also incorporate the requirement that EFT parameters be 
natural. Furthermore, the integrals over a mars that appear tend to reduce the impact on the 
pdf of data points where higher-order terms in the EFT are large, and so Eq. (I35p includes 
the notion of "theoretical uncertainty" in the fitting procedure in a well-defined way. 

While Eq. ( 135]) is general the manner in which the naturalness requirement is implemented 
is open to interpretation. For the rest of this paper we will use the maximum-entropy prior 
(I12p for our analyses. Priors in M and R also need to be specified. We will let the sum 
over M run from r to some M max . In general one should try to ensure that the parameter 
estimates for a res are not sensitive to M max . (Technically this is an implementation of an 
"improper prior" on M via a limiting procedure.) We do not have any information that 
would lead us to favor one value of M over the other, and we therefore assign a uniform 
prior to M, 

pr(M) = . (36) 

M - M • + 1 

lvl max lvl min \ i - 

Similar reasoning applies to the prior on R. We integrate over some region R m in < R < 
R max . If the data analysis is being done in a sensible choice of units we would expect that 
values of R close to 1 will be favored, but we do not wish to bias the fit unduly in this 
regard, and so we choose a uniform prior. However, since R is a scale parameter the prior 
should be uniform in log(i?) (HI], not R, and we thus obtain 

MR) = \- (37) 

Therefore while pr(a|D, M, R) is Gaussian, the final pdf of Eq. fl35l) after marginalization 
over M and R is no longer of Gaussian form. This means that — unlike the case defined by 
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Eqs. (I24p and (125]) — estimates for a,.^ cannot be determined from a simple matrix multipli- 
cation. Instead, given the pdf pr(ci res |D) we calculate the expectation values and variances 
of the parameters a res , according to: 

W = j 'd^a^ M D), (38) 

= (4) ~ M 2 , (39) 

where we have assumed pr(& res \D) to be normalized. 

This reveals another advantage of the Bayesian approach: in addition to the uncertainty 
in the data, the variance a ai includes the uncertainties due to fitting at different M's and 
choosing different i?'s. These effects are included in the final pdf pr(a. res \D). Indeed, it 
is useful to think of Eq. (1551) as a weighted sum over the possible values of M, where the 
weight is given by the probability pr(M|D) (c.f. Ref. pjl). If one specific value of M is 
much more likely than any other, the pdf pr(a res |/}) is dominated by this specific term 
in the sum and our approach will yield approximately the same answers as a fit solely at 
that particular order. This is in contrast to much of the EFT literature where assumptions 
about M are often implicitly made in parameter estimation. Equation (1351) forces and 
allows such assumptions to be explicitly included in the extraction of LECs from data. A 
similar argument holds for R: if the pdf pr(i?|/}) has a spike near a particular value of R 
then the integral over R will be dominated by that value, and a fit with R fixed would be 
quite successful. The advantage of Eq. (1551) is that it makes no assumptions about whether 
special cases associated with such peaks in the M and R pdfs are realized or not. Instead 
marginalization lets the data (together with the minimal assumptions encoded in our priors) 
determine which values of R and M will be important in the extraction of the czj's. 



III. APPLICATION TO A TOY PROBLEM 

In the following we consider an example that allows us to illustrate the main features 
and advantages of Bayesian methods in fitting data in order to extract EFT parameters. 
Instead of real data from an actual experiment we choose to generate artificial data from 
the function 

g(x) = Q + tan 

for x > 0. While we are not aware of any physical quantity that is described by g(x), it 
exhibits several features that commonly appear in the analysis of data relevant to EFTs. 
The function g(x) is nonanalytic for xeK, but within a finite radius of convergence p it can 
be approximated to arbitrary precision by a power series. With the application to EFTs in 
mind we think of p as the "high-energy" scale. Therefore coefficients in the expansion of g in 
powers of x will be natural when written in units of p. Because of the particular argument 
we have chosen for the tangent function in Eq. (T4"0"j) the radius of convergence of the Taylor 
series for g(x) is p = 1, which simplifies the subsequent discussion. It has the consequence 
that the absolute values of the coefficients in the power series expansion of g(x), 

g(x) « 0.25 + 1.57a; + 2A7x 2 + 1.29x 3 + 4.06x 4 + ■ ■ • , (41) 

are "natural" for at least the first 10 terms, with an rms value of around 3. However it is 
interesting to note that the coefficients do not decrease with increasing order, and so priors 
based on the expectation that they do |24j could potentially lead to misleading results. 
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FIG. 1: Generated artificial data D\ (x max = 1/ir, circles) and D 2 (x max = 2/ir, triangles). For 
both data sets c = 5%. The solid line is the function g{x). 

We generate "data" that are normally distributed about the curve g(x) and we assign a 
relative error c at each value of x. The data are then given by 

y(xi) = g(xi)(l + crji) (42) 
Oi = cy(xi) (43) 

where the 7/j are random numbers that are normally distributed with mean f/ = and 
variance = 1... For the following example we generate two data sets D\ and D 2 , the 
first for < x < 1/tc and the second for < x < 2 /it, each containing 10 data points with 
c = 0.05. The data are shown in Fig. [I] and can be found in App. [C] 

Our aim is to extract the coefficients of a polynomial Jm{x, a) of degree M from a fit to 
the data, where 

M 

f M (x,a) = y^jX 3 . 

As can be seen in Fig. [5J the power series expansion of g(x) up to order 3 does not reproduce 
the complete function for x > 0.4, while at order 7 good agreement is found up to x ~ 0.6. 

In this section we explore three methods for extracting the coefficients a and a\ from 
the two different data sets shown in Fig. [TJ First, we review the results obtained via a 
standard maximum-likelihood fit at different orders. Then we examine the way in which 
the augmented x 2 obtained above can be used to improve those results. Finally, we show 
how marginalization over M and R retains the improvement seen due to the use of the x! U g> 
and deals with the sometimes awkward issue of which values should be chosen for those two 
parameters. 

A. Standard maximum-likelihood approach 

We begin with a standard maximum-likelihood analysis of the data. Since our data are 
normally distributed, this reduces to a ^-minimization problem as explained above. The 
first issue is the choice of the order of the polynomial. For the first data set with x < 1/tt a 
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FIG. 2: The function g(x) (solid line) and its power series expansion at order M = 3 (short-dashed 
line) and M = 7 (long-dashed line). 

polynomial of low order might result in values of the coefficients close to the ones in Eq. ( f4Ti) . 
But it seems clear from Fig. [2] that for data corresponding to larger values of x a fit at low 
(e.g. third) order will not give accurate results for the true coefficients in the power series 
of g(x). Conversely too high an order leads to an over-constrained fit. If, as is often the 
case, the function g(x) is not known we will not know what order in M is necessary in 
order to obtain reasonable likelihoods. So here we perform the analysis at several orders, 
M = 1, . . . , 7. We start with the data set Di for which < x < The results for the first 
few coefficients at each order together with the corresponding x 2 P er degree of freedom are 
given in Table fl] Surprisingly, the quadratic fit reproduces the underlying values of ao and 



M 


X 2 /d.o.f. 


a 




«2 


1 


2.24 


0.203 ± 0.014 


2.55 ± 0.11 




2 


1.64 


0.250 ± 0.023 


1.57 ± 0.40 


3.33 ± 1.31 


3 


1.85 


0.269 ± 0.039 


0.954 ± 1.094 


8.16 ± 8.05 


4 


1.96 


0.333 ± 0.067 


-1.88 ± 2.69 


44.7 ± 32.6 


5 


1.39 


0.566 ± 0.132 


-14.8 ± 6.85 


276 ± 117 


6 


1.85 


0.590 ± 0.291 


-16.4 ± 18.1 


311 ± 395 


7 


2.67 


0.242 ± 0.788 


8.97 ± 56.3 


-373 ± 1494 



TABLE I: Fit results for standard x 2 approach with x max = 1/vr and c = 0.05. 

a>i extremely well. It also has the lowest x 2 , and so there is a good argument for accepting 
this as the true value of the fit. However, if one did not know the underlying values of ao 
and a\ one might be hard put to explain the extent to which the fit at order 2 is superior to 
that at order 3, or indeed, that at order 5. The lack of convergence for ao and a\ is rather 
disturbing. 

This problem is exacerbated when we repeat the x 2 analysis with our second data set 
D 2 , for which < x < 2/tt. The results are given in Table ILT1 As expected, the low-order 
fits do not yield results in agreement with the true coefficients in the power-series expansion 
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M 


X /d.o.f. 


a 


ai 


a-2 


o 

I 


o cr 

o.oo 


U.392 ± (J. 1)33 


-U.387 ± U.351 


o no i n ron 

8.U8 ± U.689 


Q 
O 


1 47 


n 1 41 -4- n n^s 




19 7 + Q Q 


4 


1.48 


0.246 ± 0.106 


1.79 ± 2.35 


4.81 ± 15.4 


5 


1.46 


0.00697 ± 0.217 


8.67 ± 5.94 


-59.3 ± 53.2 


6 


0.46 


0.995 ± 0.516 


-24.0 ± 16.6 


319 ± 187 


7 


0.50 


0.180 ± 1.41 


5.98 ± 51.0 


-89.9 ± 685 



TABLE II: Fit results for standard \ 2 approach with x max = 2/tt and c = 0.05. 



of g(x). But, even at higher orders, the values for the first few parameters are not close 
to the underlying values and have large errors. With knowledge of the "true" results we 
can see that the M = 4 result is the closest, but there is no minimum in x 2 /d.o.f . at order 
4, and even the zeroth-order coefficient do does not reach a stable value as the order of 
the fit is increased. This is partly because a% and a 2 tend to much larger values than the 
expected natural size in this example. The best fit appears to arise from large cancellations 
between successive terms in the polynomial, and so standard \ 2 ls powerless to obtain useful 
information on a Q and ai from the data set D 2 . 

There are possible remedies for this problem, e.g. analyzing a subset of the data cor- 
responding only to small x, but these require judgement on the part of the practitioner. 
How small is small enough for a finite-order polynomial to be a reasonable approximant to 
the underlying function? Such judgements also clearly require a trade off between making 
x small enough that the polynomial is accurate and increasing x to add more data and so 
increase the statistical power. 

B. Bayesian approach at fixed M and R 

We will now show how a Bayesian analysis of these two data sets dramatically improves 
the parameter estimation. In particular, we will show how the requirement of naturalness 
stabilizes the fit. 

First we re-analyze the low- a; data at fixed M and R, i.e. employ the augmented \ 2 
derived in Eq. ffl~6l) . As explained in Sec. [Ill we use the prior of Eq. f fl~2l) . 

/ 1 \ M + 1 / a 2 \ 

pr(a|M -* )= ta) exp (-^)' 

which is a constraint not on individual parameters, but on the ensemble average: (a 2 ). 5 In 
this case we choose R = 1 and perform fits at M = 2, . . . , 7 using the data set D^. The 
results for the leading parameters and the corresponding probability are given in Table IHIl 
One immediately sees the influence of the prior on the results. The estimates of the first 
two parameters hardly change as M is increased and are much closer to the "true" values 
than in the standard \ 2 approach, with the exception of the excellent quadratic fit. It is also 
noticeable that the uncertainty on a\ and a>2 has decreased dramatically. However, none of 



5 Note that here, in contrast to Eq. (|38p . the integration is performed over all a. 
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AT E>\1 

M, r£JJ 


a 




0.2 


Z 


12.00 






z.Ub ± (J. 25 


l.bO ± 0.78 


Q 
O 


1 1 9^ 




U.ZiOU HZ U.Ulu 


9 OzL 4- 9^ 


1 .ou in u.iy 


4 


10.35 




0.230 ± 0.018 


2.04 ± 0.25 


1.49 ± 0.80 


5 


9.43 




0.230 ± 0.018 


2.04 ± 0.25 


1.49 ± 0.80 


6 


8.51 




0.230 ± 0.018 


2.04 ± 0.25 


1.49 ± 0.80 


7 


7.60 




0.230 ± 0.018 


2.04 ± 0.25 


1.49 ± 0.80 



TABLE III: Fit results for Bayesian approach with R = 1. x max = 1/tt and c = 0.05. 



M 


log[pr((a)|Di, 


M, R)] 


a 


o\ 


02 


2 


9.62 




0.248 ± 0.023 


1.63 ± 0.39 


3.15 


± 1.27 


3 


7.10 




0.247 ± 0.024 


1.65 ± 0.45 


2.98 


± 2.32 


4 


4.57 




0.247 ± 0.024 


1.64 ± 0.46 


2.98 


± 2.39 


5 


2.04 




0.247 ± 0.024 


1.64 ± 0.46 


2.98 


± 2.39 


6 


-0.488 




0.247 ± 0.024 


1.64 ± 0.46 


2.98 


± 2.39 


7 


-3.02 




0.247 ± 0.024 


1.64 ± 0.46 


2.98 


± 2.39 



TABLE IV: Fit results for Bayesian approach with R — 5, x max — 1/tt and c = 0.05. 



the extracted parameters lies within la of its underlying value. Eq. (HTl) shows that, with 
the exception of ao, the first few coefficients in the expansion are in fact all larger than 
1. This means that the naive choice R — 1 to express what we mean by "natural" is too 
restrictive, forcing our augmented \ 2 to a local minimum that is not related to the true 
values of the parameters. As R is increased, this modification of the x 2 hypersurface by the 
augmentation becomes less severe, allowing the fit to explore a larger domain of parameter 
space. The results for R = 5 demonstrate this, and are shown in Table IIVI Again, all three 
parameters considered here show very good convergence with respect to M. The central 
values now lie closer to the correct ones than before: all are within la. This is in part 
due to the increase in the uncertainties, particularly for a 2 , which exhibits sizeable errors. 
Choosing a less restrictive prior has allowed a wider range of parameter values, while at 
the same time the results also tell us that the data is not sufficient to determine a 2 and 
higher-order coefficients. As expected, if we continue to increase R our results go over to 
those of the standard x 2 > so f° r the augmented \ 2 to be a useful technique a reasonable 
value of R must be employed. 

Meanwhile Tables |V] and IVII show the results of a similar x 2 U g minimization at different 
orders for the data set D 2 for R = 1 and R = 5, respectively. We see that — in contrast 
to the unaugmented-x 2 — there is no need to throw away any of the high-x data in order 
to obtain a stable fit. The entire data set D 2 can be used to perform an analysis which 
converges with respect to the order M at which the expansion is truncated. Again, the 
results for ao and a\ obtained with R = 5 are better than those for which R — 1, with the 
results in Table IVII in agreement with the underlying values. No useful information on a 2 
can be extracted in either case. The results of Table M and IVII also show that this data set 
has less power to determine the LECs ao and ai than does the lower- a; data set of the same 
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log[pr((a)|L>2,-M,-«)J 


a 


ai 


«2 
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-26.10 




0.807 ± 0.283 


5.60 ± 0.55 
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4 


-13.17 


0.295 ± 0.029 


1.16 ± 0.29 


2.88 ± 0.74 


5 


-12.56 


0.292 ± 0.029 


1.23 ± 0.29 


2.71 ± 0.75 


6 


-12.85 


0.290 ± 0.029 


1.26 ± 0.30 


2.65 ± 0.75 


7 


-13.51 


0.289 ± 0.029 


1.27 ± 0.30 


2.63 ± 0.75 



TABLE V: Fit results for Bayesian approach with R = 1, x max = 2/ir and c = 0.05. 



M 


log[pr((a)|I>2 


M, R)] 


a 


ai 


02 


2 


-11.3 




0.386 ± 0.033 


-0.312 ± 0.347 


7.93 ± 0.68 


3 


-4.90 




0.268 ± 0.043 


1.96 ± 0.64 


-2.33 ± 2.52 


4 


-4.25 




0.242 ± 0.045 


2.30 ± 0.65 


-2.23 ± 2.52 


5 


-5.63 




0.239 ± 0.045 


2.28 ± 0.65 


-1.58 ± 2.56 


6 


-7.69 




0.240 ± 0.045 


2.24 ± 0.66 


-1.20 ± 2.59 


7 


-10.0 




0.241 ± 0.045 


2.21 ± 0.66 


-1.01 ± 2.61 



TABLE VI: Fit results for Bayesian approach with R = 5, x max = 2/ir and c = 0.05. 



statistical weight, D x . 

The reader might object that this procedure is guaranteed to lead to a good fit, as an 
increase in the number of parameters automatically means a better fit to the data. The 
Tables above show that this is not the case: the logarithm of the maximum probability 
at fixed order M peaks at a particular value of M (which is, not surprisingly, higher for 
D 2 than for Di) and does not continue to grow with M. Higher-order fits are not always 
more probable since there is a "phase-space penalty" in the pdf for introducing additional 
parameters into the fit 19[ . 

The method used to obtain Tables UTTllVIl is very close to that of Refs. 20|, |21[, which ad- 
vocate the extraction of energies and amplitudes from lattice results for hadronic correlation 
functions via a technique the authors call "constrained curve fitting". Constrained curve 
fitting is also based on Bayesian probability and the inclusion of prior information to refine 
estimates of the parameters of interest. There too maximum entropy results in a Gaussian 
prior pdf, which in turn gives a Gaussian posterior with a modified y 2 . Andj_as in the case of 

" 2l and Ref. M is an infinite 
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this Section, the underlying function considered in Refs. 
sum of terms, and to perform the fit it has to be truncated at a certain order, and then 
convergence with respect to that truncation is sought. The only real difference between the 
methods lies in the prior information employed. While both formulations lead to Gaussian 



prior pdfs, Refs. [20|, |2l|, |30J employ information on the mean and variance of the individual 
parameters, while in this paper only the less restrictive knowledge of an ensemble average 
is assumed. One would arrive at the prior pdf Eq. in the formalism of Refs. @, H S3 
by assigning /i = and a — R to each individual parameter. Therefore our method can be 
considered a special case of constrained curve fitting. However, if more detailed information 
on individual parameters is available, it can and should be employed. 
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The results of this section show that, as might be expected, the choice of R has an impact 
on parameter estimation. If the true values of parameters are not known, even the notion 
of naturalness does not give clear guidance on which R to choose, as both R = 1 and R = 5 
are natural. We therefore advocate marginalization over R over a suitable region, thereby 
taking into account all reasonable values of R. 

C. Bayesian approach including marginalization over M and R 

In the fits of the previous section the value of a 2 is not well constrained by data set D 2 , 
and neither data set is sufficient to extract useful information about the coefficients of the 
cubic (and higher) terms. Therefore from now on a 2 , a 3 , ...will be considered nuisance 
parameters. This will allow us to focus on the pdf for a® and a\. We adopt a res = (a , a\), 
and compute the pdf pr(a res \D) using Eq. (1351) . That pdf depends on the choice of the 
range of M values over which we marginalize. Above we advocated increasing M max until 
convergence with respect to that parameter was obtained. Convergence for this example 
(both D\ and D 2 ) is obtained by M max = 8, with the results for ao and a\ unaffected by the 
inclusion of higher orders beyond that in the marginalization over M. Such a result was to 
be expected given the peaks in the posterior probability as a function of M seen in Tables 
IIIIHVfi and the fact that the results for (ao) and (a x ) presented there are not altered as M 
is increased from 5 to 6 to 7 and so on. 

With this in mind we choose M min = 2, M max = 8, R m in = 0.1, R max = 10, and compute 
the pdf of Eq. (135]) using data set D\. The results for ao and a\ are 



The results here are consistent with those shown in Tables II I II and IIV] but we emphasize 
that (1441) and (l4"5j) include not only marginalization over M but also marginalization over 
R. Since the bulk of the contributions to the R integral come from R between 2 and 5 and 
the errors are quite R dependent the errors found in Eqs. (jUJ) and (l4"5j) are a little larger 
than those in Table 1111} but are smaller than those in Table IIVI Marginalizing over R thus 
incorporates the systematic uncertainty due to the ambiguity in which value of R to choose, 
and does so in a way that lets the data determine which values of R should dominate the 
marginalization integral. 

As a check on the stability of this procedure we have repeated the calculation choosing 
Rmax = 20, resulting in 



which is consistent with (1441) and fj4"5l) . While it might violate the strict principles of a 
Bayesian analysis, examining the results for dependence on R ma x and looking for a plateau 
like the one seen here serves as a way to check that one has made a safe choice for R ma x- 
(Note that there is no corresponding dependence on R m in, since the extremely restrictive 
priors corresponding to small R contribute little to the final pdf.) 

Obviously in most real applications the underlying values of the parameters are not known 
a priori. But, in this case we have the "true" values of 0.25 and ~ 1.57 in hand, and we see 



a = 0.239 ±0.021 
ai = 1.84 ±0.37. 



(44) 
(45) 



a = 0.239 ± 0.022 
ai = 1.83 ±0.37, 
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that the Bayesian result is a much better extraction of the values of ao and a\ than the ones 
obtained from a standard x 2 analysis of this data set (with the exception of the surprisingly 
good quadratic fit). This builds confidence in the marginalization strategy represented by 
Eq. ([25]). 

Applying Eq. fl3SJ to data set D 2 , again with M min = 2, M max = 8, R min = 0.1, R max = 
10, we obtain 

a = 0.241 ± 0.048 (46) 
ai = 2.23 ± 0.74. (47) 

These are again consistent with results obtained by minimizing xiug an d seeking convergence 
with M (as long as the choice of R is not too restrictive). Eqs. (JIB]) and fj4T|) reproduce the 
underlying coefficients in the Taylor series much better than does a standard \ 2 analysis of 
D 2 . The errors are larger than in the case of the D\ results, since the data extends to larger 
x, and so higher-order effects are more important. Another indication that the data set D 2 
does not determine the parameters as well as the set D\ comes from the observation of small 
residual i? max -dependence of a\\ with With R max = 20 we obtain 

a = 0.237 ±0.050 (48) 
ai = 2.28 ±0.80, (49) 



while R max = 40 gives 



a = 0.237 ±0.051 (50) 
ax = 2.30 ±0.82. (51) 



While ao does not depend on R ma x, the dependence of a± and its error on this choice shows 
that the data in set D 2 cannot determine the central value to better than two digits, or the 
standard deviation to better than one digit. 

We have also extracted the pdf based on only a subset of the data set D 2 . We chose its 
first 5 data points, which span the range < x < 1/tt, and denote this by D 2 . (This is 
not the same data as in set D\, which contains 10 data points.) The results for M min = 2, 
M max = 8 and R min = 0.1, R max = 10 are 

a = 0.238 ± 0.038 (52) 
ax = 2.12 ±0.49. (53) 

We consider it a strength of our technique that we can analyze either a subset or the 
full data set and obtain consistent results. Comparison of the analyses using D 2 and D 2 
shows that a and cti are mainly determined by D 2 . (This conclusion is also apparent from 
the fact that the standard deviation for ao in Eq. (|46|) is significantly larger than the naive 
cj y/~N.) But the Bayesian techniques allow us to use all the available data without prejudice 
as to which x's are 'too large' for an EFT calculation at a given order M to be reliable. 
Indeed, data on intervals [0, x max ] with x max quite close to p can be employed for parameter 
estimation. One might be concerned that this will lead to the use of data with x > p, but, 
at least in the example considered in this section, including such data in the fit leads to 
a ridiculously small posterior pdf pr(& res \D). The probability of the parameters given the 
data goes through an abrupt drop due to the singularity that cannot be reproduced by the 
"EFT" polynomial form. 
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Thus Bayesian methods should allow EFT practitioners to perform reliable parameter 
estimation without undue concern about whether data is safely within the region of validity 
of the EFT. If data that is beyond the reach of the EFT is employed for the fitting the 
behavior of the posterior pdf pr(a res |Z}) will provide signals of the theory's breakdown. An 
important caveat is, however, that in many examples (e.g. that of Sec. |V} the physical 
quantity in question will not have a pole on the real axis when the radius of convergence is 
reached, and so the drop in pr(ct res |D) will not be as dramatic as it is here. 

IV. NUCLEON MASS IN CHIRAL PERTURBATION THEORY 

Having demonstrated the key features of our Bayesian approach using a toy model in 
the previous section, we now turn to an application in an actual EFT, chiral perturbation 
theory. In particular we will explore the chiral expansion of the nucleon mass in SU(2) xPT, 



which can be written as (see e.g. [35j, |36|, |37|, |38j) 



(TTl \ 
— J + k^m 4 + k 5 m 5 log 

—J + k 8 m 6 log ( — J + k 9 m 6 + 0(m 7 ). (54) 

Here M is the nucleon mass in the chiral limit and m denotes the lowest-order pion mass. 
The coefficients ki in Eq. (1541) are linear combinations of the coefficients of the operators 
that appear in the Lagrangian. Specific expressions for the LECs fc, in terms of these 
coefficients can be found in Ref. 38| and we do not reproduce them here. As stated in the 
Introduction it will be the fcj's that we focus on fitting, specifically M , the nucleon mass in 
the chiral limit, and k\, which is proportional to the nucleon sigma term. 

With the notation of Eq. ( 1541) the LECs ki are dimensionful quantities, with increasing 
inverse powers of some energy scale. We now rewrite Eq. (154"1) to make that scale explicit: 



M xPT (m) M h 2 k 2 3 k 3 4 (m\ k A 4 k 5 5 f m 
+ -r^m + — m + -rjm log — + -rjm + -r^m log 



A A A 2 A 3 A 4 & V W A 4 A 5 V A* 



+ TT m + T7 m log — + T7T m log — + T7 m + • • • , (55) 

A 5 A 6 V A* / A 6 V A* / A 6 v ' 

where we have also rescaled the nucleon mass since here we restrict ourselves to theories with 
only one high-energy scale. A critical question for any attempt to use Eq. (1551) to analyze 
data on the behavior of M as a function of m is now: for what scale A are the dimensionless 
fcj's of Eq. (j55"|) natural? 

One might expect that they would all be natural with respect to the nominal breakdown 
scale of xPT, A = 4ttF, with F the pion decay constant. But several of the fcj's are 
significantly larger than is indicated by naive- dimensional analysis with respect to this scale. 
For example, the coefficient k§, which defines the leading non-analytic contribution at the 
two-loop level, is given entirely in terms of low-order coefficients: 

h = W24^ (16 ^ " 3) = WFY 1 • (56) 
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Strictly, g A , the axial coupling of the nucleon, and F both take their chiral-limit values here. 
But the difference between physical values and chiral-limit values is a higher-than-fifth-order 
effect, and using physical values for evaluation we obtain k 5 of roughly 85, if A = 4ttF. 
A second example is the coefficient fc 4 , which is often written as 

fcj = ^- Jf% (g ^ C2tfo) ' (57) 

(2) 

where C2 is a coefficient in the second-order pion- nucleon Lagrangian, C^^, and the contri- 
bution e.\ stems from C^. 6 A fit to lattice data on M N {m) using the fifth-order x?T form 
obtained i\ = —30.5 GeV -3 , 7 which is perhaps not surprising given a k$ of 47.56 GeV -4 [39]. 

Here we choose e.\ = —8 GeV -3 which is comparable in magnitude to the value extracted 
in Ref. [39]. (This value is also not unreasonable give n that e\ = 16e 38 + 2en 5 + 2en 6 in 



terms of the coefficients in the Lagrangian of Ref. [4l|], if we assume that the operators in 
that C^jf are the ones whose coefficients are 0(1).) This, together with the physical values 
of g A and F and the values of the LECs ci, Co, Cs as found in Ref. |42j, as well as the mesonic 
LECs from Ref. [3] and die = —1-93 GeV -2 |43| (all evaluated at the scale \x = M ) yields: 

h = 3.84, k 2 = -5.63, k 3 = 6.49, h = 8.28, k 5 = 47.56, k 6 = 70.53, k 7 = 1 2.8, (58) 

where each kt is expressed in appropriate units of (GeV) _n . The values of kg and k 9 depend 
on a number of unknown higher-order LECs, and for the demonstrative purposes of this 
section we set them to 

fc 8 = 10, A: 9 = -100. (59) 

so that when we supplement M x prp(m) by a model at higher m in the next section we obtain 
a smooth function for M^(m) over a range of m up to 1 GeV. 

The value we choose for M Q is 0.88 GeV, but this particular choice is not relevant to 
the success or failure of our parameter-estimation strategy, since we will extract it from our 
pseudo-data using Eq. (135]) . This pseudo-data is not data from lattice QCD, but instead 
consists of 11 data points produced using the formula (154")) and the constants (158]) in the 
region between m = 200 and m = 500 MeV. We added normally-distributed offsets of mag- 
nitude 1.5% to the underlying form (1341) .This error is quite conservative given the precision 



reached in modern lattice calculations (see, e.g., the compilation in Appendix B of Ref. (44|). 
In the following we take the basis functions in Eq. (1351) to be 

fo(x) = 1; f 2 (x) = x 2 ; f 3 (x) = x 3 ; f Aa (x) = x 4 log(x), f 4b (x) = x 4 ; 

/6a 0*0 = x 5 log(x), fsb(x) = x 5 ; f 6a (x) = x 6 log 2 (x), f 6b (x) = x 6 log(a;), f 6c (x) = x 6 ; 

(60) 

with x = m/A. Initially we choose A = 1 GeV. We consider calculations at various different 
chiral orders, not with various different numbers of basis functions, e.g. the three functions 
f&ai feb, and / 6c are all added to the fit together when we go from order P = 5 to order 



6 There does not seem to be a consistent notation for this coefficient, as it is also referred to as or 
simply e, and sometimes also defined with the opposite sign. 

7 The actual value of e\ depends on the choice of renormalization scale fi, which in Ref. [39( was taken to be 
[i = mjj tys . Here we are not concerned with its actual value, but just its approximate magnitude, which 
is only logarithmically dependent on \i. 
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FIG. 3: Data generated from Eq. (|54p with 1.5% error. 



Rmax 


M (GeV) 


ki (GeV" 1 ) 


10 


0.917 ± 0.027 


1.98 ± 0.73 


20 


0.916 ± 0.026 


1.98 ± 0.74 


40 


0.916 ± 0.027 


1.98 ± 0.74 


80 


0.916 ± 0.027 


1.98 ± 0.75 


160 


0.916 ± 0.027 


1.98 ± 0.75 


240 


0.917 ± 0.026 


1.97 ± 0.74 


400 


0.917 ± 0.026 


1.98 ± 0.73 



TABLE VII: Dependence of estimated parameters on R max - The fit was performed with m = 
200 — 500 MeV and A = 1 GeV. Marginalization is over chiral order P = 3 — 6 with R m in = 0.1. 
The data were generated with M = 0.88 GeV and h = 3.84 GeV" 1 . 
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P = 6, although each functions have an independent coefficient. We use a modified version 
of Eq. f[3l)j) that takes account of this distinction between chiral order and number of ba- 
sis functions. In order to demonstrate our general method we continue to use the prior of 
Eq. ( |T2|) in this first study of the problem. We note, though, that %PT does provide addi- 
tional information on most of the coefficients of the basis functions which are non-analytic 
in the quark mass (see, e.g. Eq (|56|) ). and that in future studies such information could be 
used to refine the prior on the fcj's. 

We focus on the first two parameters, which are Mo and k\. Marginalization is performed 
from Pmin = 3 to Pmax = 6. Rmin is chosen to be 0.1, and we vary R max to produce the results 
shown in Table IVHI We do not see any appreciable R ma x dependence in the results. This 
lack of R m ax dependence might lead one to trust these as reliable LEC extractions, but in 
fact the central value for M is slightly more than la from the "true" value (M = 0.88 GeV), 
while the discrepancy for h\ is more than 2a (h = 3.84 GeV" 1 ). 

We attribute this failure to the fact that several of the LECs used to generate our pseudo- 
data are not natural with respect to the scale A=l GeV. As discussed above, the scale A 
at which the low-energy constants of the EFT are natural must be specified in order to use 
our "naturalness prior", i.e. the units in which the LECs are expected to be 0(1), must 
be chosen. Several of the LECs in Eqs. fl58|) and (159]) are not 0(1) for A = 1 GeV. The 
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7~> 


Mq [ijev ) 


K\ (Lrev J 


1 


qi _l n no 


9 11 +1 ^fi 


20 


0.91 ± 0.03 


2.11 ± 1.35 


40 


0.91 ± 0.03 


2.13 ± 1.37 


80 


0.91 ± 0.03 


2.10 ± 1.36 


160 


0.91 ± 0.03 


2.08 ± 1.33 



TABLE VIII: Dependence of estimated parameters on R max - The fit was performed with m = 
200 - 500 MeV and A = 500 MeV. Marginalization over chiral order P = 3 - 6 and R min = 0.1. 



marginalization over R is not sufficient to compensate for such misidentification of the EFT's 
underlying scale. While — especially for low-order coefficients — trade-offs between A and R 
are possible, a poor choice for A means that higher-order coefficients grow without bound. 
Changing A amounts to a change of units and so makes no difference to fits using the standard 
X 2 , but choosing A too big means that higher-order coefficients make disproportionately 
large contributions to x^rior- Correct identification of A is thus key to reliable parameter 
estimation using the formula fl35|) . In practice this choice of A then supplies part of the prior 
used to produce the posterior pdf. From now on we make this dependence on the underlying 
scale explicit by writing the posterior pdf as pr(a res |D, A). 

The literature on baryon x?T together with the observed coefficients in the expansion f[5"4"|) 
suggest a scale A = 0.5 GeV is reasonable. ;\RT may cease to be useful for m significantly 



below 500 MeV, e.g. the 350 MeV identified in Refs. [ll|, |39|, |40| , but with A = 500 MeV as 
the formal radius of convergence of the ;^PT expansion for MAr(m) the coefficients considered 
here, while still a bit large in some cases, are 0(1): 

M = 1-76, h = 1.92, k 2 = -1.41, k 3 = 0.81, h = 1-03, , , 

k 5 = 2.97, k 6 = 4.41, k 7 = 0.4, k s = 0.31, k 9 = -3.12. ^ ' 

Repeating the analysis of the data displayed in Fig. [3] with the set of basis functions ( 1601) . 
this time defined with A = 500 MeV, we again marginalize over P m in = 3 to P ma x = 6 with 
Rmin = 0.1. We obtain the results shown in Tab. IVIIIt where for the convenience of the 
reader we have converted the results back to units that are powers of a GeV. 

As in the case A = 1 GeV that is presented in Tab. IVIH we see only very weak R max 
dependence. Indeed, our analysis of the integrals over the parameter R shows that the 
main contribution to the results presented here stems from the region R w 1 — 2. This is 
reassuring, since it suggests that naturalness is compatible with the data if A is chosen to 
be 500 MeV. Also reassuring is that the lowest-order parameter, Mq, now agrees with the 
underlying value at the la level. But the central value of k\ found by fitting the pseudo-data 
is still a bit more than la below the underlying hi, with both the error and the central vale 
of ki increasing noticeably compared to the values presented in Table IVHI 

This trend persists as A is lowered further. In Tab. [IX] we show results for M and k\ at 
several values of A. Since the R ma x dependence is very small for all considered values, we 
only show results obtained with R max = 10. The extracted values for M do not change for 
A below 0.4 GeV, and agree with the underlying value there. The parameter ki, on the other 
hand, does not show any plateau in the considered range of A: its central value increases 
monotonically as A decreases. Once A < 0.4 GeV the extracted k\ agrees (within the la 



23 



A ( ^\ T\ 

A (LieVJ 


Mq [Lxev ) 




1 


n no i n no 

0.92 ± 0.03 


1.98 ± 0.73 


u. o 


u.c/i in u.uo 


9 1 Q _|_ 1 Qfl 


0.4 


0.90 ± 0.04 


2.50 ± 1.43 


0.35 


0.89 ± 0.03 


2.72 ± 1.34 


0.3 


0.89 ± 0.03 


2.88 ± 1.21 


0.25 


0.89 ± 0.03 


2.99 ± 1.09 



TABLE IX: Dependence of estimated parameters on the scale A. The fit was performed with 
m = 200 — 500 MeV, marginalization over chiral order P = 3 — 6 and R = 0.1 — 10. 



p 


M (GeV) 


fci (GeV- 1 ) 


X 2 /d.o.f. 


3 


1.12 ± 0.025 


-7.60 ± 0.63 


4.13 


4 


1.02 ± 0.29 


-16.4 ± 33.8 


0.95 


5 


4.37 ± 7.07 


-987 ± 2303 


1.24 



TABLE X: Results of standard least-square fit at different chiral orders P for A = 1 GeV. 



error) with the underlying one. It is interesting to see that the error on ki is largest at 
A = 0.4 GeV and decreases for smaller values of A. However, this behaviour is peculiar to 
k\. if one tries to extract higher-order parameters from the data their errors behave ~ 1/A n , 
where n is the dimension of the LEC in question. In the case that the underlying scale of 
the EFT is not well known, a study of the A dependence of the extracted LECs might be a 
useful check to see whether the results are reliable. Viewed in that light, Table ITXl suggests 
that, absent specific assumptions regarding A, the only conclusion we can draw about k\ 
from these data is that it is larger than 0.8 GeV -1 and smaller than 4.1 GeV -1 . 

Since this nucleon-mass data set gives only a weak constraint on k\, we now marginalize 
over it too. The resulting Mq extraction does not show any A dependence in the range 
A = 300 - 500 MeV. We find: 

M = 0.91 ± 0.04 GeV. (62) 

This agrees with the results shown in Table IIXI where marginalization over k\ was not 
performed, and is consistent with the underlying value of M = 0.88 GeV. 

We have also performed a standard least-squares fit to the data at several orders. The 
results are shown in Table IXl The result with the best % 2 /d.o.f. gives an acceptable value 
for Mo, although with a much larger error than is obtained from the Bayesian approach. 
That result is not, though, stable with respect to the order of the fit. The standard x 2 ^ so 
shows that k\ cannot be usefully constrained from these data. 

V. FITTING NUCLEON-MASS DATA BEYOND xPT'S DOMAIN OF VALIDITY 

The fits performed in the previous section had the advantage that the fit function included 
all of the terms in the underlying function used to generate our pseudo-data. In this section 
we examine what happens when — as in Sec. IHH — the EFT form breaks down at m — A, and 
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goes over to some other function which cannot be written in terms of the basis functions used 
in the fitting procedure. This transition to non-EFT dependence of the physical quantity on 
the independent variable may be smooth: one does not necessarily expect a singularity in 
the quantity of interest as the EFT's breakdown scale is crossed (c.f. Sec. IIIip . Nevertheless, 
we shall show that the tools used to analyze the toy-model data of Sec. II II I allow consistent 
fits across the boundary at A, and permit diagnosis of the situation in which the fit includes 
too much data that is outside the EFT's domain of validity. 

For our analysis we modify the function used to generate artificial nucleon-mass data by 
smoothly "turning off" the chiral dependence and "turning on" a model dependence for pion 
masses above a scale A: 

M N (m) = M xPT (m) (l - g (J^Jj + M model (m)g , (63) 

where g is a smooth function obeying g(0) = and lrnx^oo g(x) = 1. In order to have xPT 
be valid below the scale A but breakdown at A one should choose g(x) to have a Taylor-series 
expansion about x = with a radius of convergence of 1. In order not to disturb the terms 
up to sixth order in M x pr(m) and the terms there that are non-analytic in the quark mass 
(odd in m) one should also demand that g be even in x and have vanishing second-, fourth- 
and sixth-order coefficients. These requirements are satisfied by 

2 

g{x) = — arctan(x 8 ). (64) 

7T 

In the previous section we argued for a value of 500 MeV for the underlying EFT scale A, 
and that is what we choose in the form ( |63l) here. As in Ref. (45| the function M mo & e \{m) is 
chosen to have the correct heavy-quark limit: 

M mode i(m) = a + (3m. (65) 

The resulting Mjv(m) contains linear dependence on the pion mass m which is not present 
in the functions with which we analyze the data. 8 We select parameter values 

a = 1 GeV; = 1, (66) 

to give a smooth transition around m — A. The resulting functions M x pp(m) and Mjy(m) 
are shown in Fig. |H 

We generate several data sets between 200 MeV and a varying m max , each with 11 data 
points. We choose A = 0.5 GeV and statistical uncertainties of 1.5%. In our analysis of the 
resulting pseudo-data we marginalize over P = 3 to P = 6 and set R m in = 0.1. 

We begin by presenting results for m max = 350 MeV, which places all data far enough 
below A that we expect the non-chiral piece of Mjv(m) will not play a large role. The central 
values and la errors on M and k\ for different i? m «'s are shown in Table IXII As was the 



Ref. [451 ] argues that ~ m q at quark masses below the charm-quark mass, so the form (|65|) would not 
become appropriate until much higher m q . For our purposes the key point is that the linear- in-ra term 
is not present in our fit function M x py(m), and so its appearance means that the xPT expansion has 
definitely broken down above m = A. See also Ref. [3] where a linear fit to M(m) works to surprisingly 
low m q . 
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FIG. 4: The functions M x PT{m) (dashed line) and Mjv(m) (full line). 



Rmax 


M (GeV) 


fci (GeV" 1 ) 


10 


0.91 ± 0.04 


1.61 ± 1.86 


20 


0.91 ± 0.04 


1.61 ± 1.88 


40 


0.91 ± 0.05 


1.62 ± 1.90 


80 


0.91 ± 0.04 


1.61 ± 1.89 


160 


0.91 ± 0.04 


1.59 ± 1.85 



TABLE XI: Dependence of estimated parameters on R ma x for data generated with Eq. (|63|) . The 
fit was performed with m = 200 — 350 MeV and A = 500 MeV. Marginalization over chiral order 
P = 3 — 6 and R m in = 0.1. 



case in Sec. HVl we find that the extracted LECs are largely independent of Rmax- Their 
values are consistent within the errors with those obtained from fitting the "purely chiral" 
data in the mass range 200 MeV < m < 500 MeV. 9 At these low pion masses we do not 
gain much of a constraint on k\ from a data set consisting of 11 points with 1.5% errors. 

Next, we consider the effects of the transition across m = A in the fit. For this exercise we 
choose Rmax = 10 as this is in agreement with our knowledge of the the underlying values of 
the LECs (we have checked that our results have only minor Rmax dependence). Table IXLTI 
then shows the results for several values of m ma x- Note that since we employ Eq. flBSJ) to 
generate the data, the high-m data points start to deviate from the form M x pr(m) starting 
around 450 MeV, as can be seen in Fig. HI The results show a drop in the probability once 
data sets with m ma x larger than 500 MeV are considered. Indeed, the maximum probability 
obtained with the data set that extends to m max = 1 GeV is more than on order of magnitude 
smaller than that found in the case of m max = 500 MeV. ;^PT is, unsurprisingly, less likely 
to be the theory that describes data sets which extend beyond 500 MeV. 

As expected, k\ is only weakly constrained by any of the fits that have reasonable values 



9 They also agree very well with the LECs extracted from a "purely chiral" data set covering the pion-mass 
range 200 MeV < m < 350 MeV. 
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m max (MeV J 


Mq (Lrev j 




i „ „, r__ ,. / _ 1 t~\ n jf AM 

log [pr(a res \D,M,A)\ 


OCA 

350 


U.yl ± (J.U4 


i i i i or 1 
l.Ol ± 1.50 


1.97 


zinn 


n Q9 -u n c\A 

u.yzi zrz L/.u^i 


1 U J. 1 u 
±.o*± ZIZ _l.o*± 




450 


0.92 ± 0.03 


1.35 ± 1.27 


2.71 


500 


0.93 ± 0.03 


1.11 ± 1.07 


3.01 


700 


0.95 ± 0.04 


0.50 ± 1.27 


2.12 


1000 


0.88 ± 0.03 


2.43 ± 0.94 


-0.03 



TABLE XII: Dependence of estimated parameters on m max for data generated with Eq. (I63p . The 
fit was performed with m = 200 — m max MeV and A = 500 MeV. Marginalization over chiral order 
P = 3-6and« = 0.1-10. 



mmax (MeV) 


M (GeV) 


log [pr(a re .|D,Af,A)] 


350 


0.91 ± 0.04 


2.14 


400 


0.92 ± 0.03 


2.41 


450 


0.92 ± 0.02 


2.55 


500 


0.92 ± 0.02 


2.25 


700 


0.94 ± 0.04 


1.58 


1000 


0.88 ± 0.03 


2.34 



TABLE XIII: Dependence of estimated parameters on m max for data generated with Eq. (I63p . 
The fit was performed with m = 200 — m max MeV and A = 500 MeV. Marginalization over chiral 
order P = 2 - 6 and R = 0.1 - 10. 



of log [pr(a res |D, M, A)]. The results for M are relatively stable even when a sizable part of 
the data is generated with m > A, although in a few cases the underlying value is outside 
the la range. 

Marginalization over k\ does not improve the situation. Results of extractions of M 
alone for data sets with different m max are shown in Tab. IXIIII Once again, some of the 
extracted values are more than la from the underlying value. We attribute these failures to 
reproduce the underlying value to the fact that A = 500 MeV does not yield LECs in the 
chiral expansion of Mjy that are particularly natural. Investigation on how the parameter 
extraction changes as we vary the scale that appears in the naturalness prior is ongoing. 

For the fit with m max = 700 MeV five of the eleven data points are above A. In order to 
examine what happens if we fit a subset of these data we have used the first 6 data points 
in that set (those with m < A) and found: 

M = 0.93 ± 0.04 GeV, 
fci = 1.17 ± 1.38 GeV" 1 . 

The result is consistent with that obtained using the full m max = 700 MeV data set, indi- 
cating that the fit to those data is driven by the lower-mass points. 

We have used different functional forms to model the high-m behavior, such as negative 
values for the slope in the linear model of Eq. fl65|) and a quadratic dependence on the pion 
mass. The results of these investigations are qualitatively the same as those presented here. 
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We therefore conclude that the formula (|35|) can be used, together with a %PT fit form, 
to obtain the pdf pr(a res |D, A) from data on Mjsf(m) that deviates from the prediction of 
%PT above m = A. Fits of data over different intervals in m — including intervals extending 
beyond A — yield consistent results for M and k\. When the fit extends torn > A the result 
is driven by data that lie in the chiral regime unless such high m's are considered that the 
fit becomes exceedingly unlikely. However, if the results extracted of M and k\ are to be 
in agreement with the underlying values of these parameters, a value of A that corresponds 
to LECs of 0(1) must be used in the application of the nautralness prior. 



VI. CONCLUSIONS 

When used for the extraction of effective-field-theory low-energy constants from data 
standard methods for parameter estimation require some art in their application. Only data 
that are within the domain of validity of the EFT should be used for this purpose, and 
the process of deciding what that domain is may require iteration: the data used for the 
fit must be successively trimmed to yield a reasonable result for LECs. Furthermore, one 
of the criteria for a "reasonable result" for the EFT fit is that there should not be large 
cancellations between different orders in the EFT expansion, and that the resulting EFT 
LECs should be natural. 

In the Bayesian framework all of this information is put into the analysis from the be- 
ginning. Indeed a strict Bayesian would incorporate all the requirements mentioned in the 
previous paragraph in the prior pdf and then report only the resulting posterior pdf. The 
fact that this posterior pdf is precisely the quantity that those trying to obtain LECs are 
interested in, pr(a|D, /), makes Bayesian methods and EFT a very good match, as does 
the fact that in both EFT and Bayesian approaches unwanted degrees of freedom are "inte- 
grated out" of the calculation, leaving one free to focus on the entities that determine the 
low-energy dynamics. 

One straightforward way to employ Bayesian methods in EFT parameter estimation is 
through minimization of the augmented x 2 , a procedure that has been dubbed "constrained 



curve fitting" and advocated for the analysis of lattice correlation functions in Refs. [20l . |21 
In Secton[TT]we showed how a "naturalness prior" obtained using the principle of maximum 
entropy leads straightforwardly to an augmented x 2 . In Sec IIIIBI we showed, via a toy 
problem, that the use of the prior information stabilizes the fit of EFT LECs with respect 
to the order, M, at which the extraction is done. It also refines the parameter estimates, 
resulting in smaller errors — especially on parameters beyond do- 

The parameter R that encodes the ambiguity in the notion of "(9(1)" coefficients is an 
input to the naturalness prior. An approach that is more general than constrained curve 
fitting focuses, not on the pdf pr(a res |D, R, M) that was computed in Sec. IIII B[ but on 
pr(sL res \D). This requires marginalization over M and R, which can be straightforwardly 
done using standard rules of probability and Bayes' theorem, yielding Eq. (135|) : 

M m ax l>R maI 



pr(& res \D) = dR d& marg 

M = r Rrnin 



pr(D|a, M)pr(a|M, i?)pr(M)pr( J R) 
pr^D) ' 



While similar in general spirit to constrained curve fitting, analysis of an EFT problem using 
Eq. (1351) could be considered more general. Eq. (1351) accounts for marginalization over the 
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order M, and thereby includes the uncertainty due to the truncation of the fitting function. 
It also requires only very weak assumptions about the value of R. 

We then applied these methods to the chiral expansion of the nucleon mass, Mjv, as a 
function of the pion mass m. We analyzed pseudo-data generated from both the xPT form 
of and a form which included additional physics at m > 500 MeV. We did this by 
fitting the value of all coefficients in the chiral expansion up to terms of 0(m 6 ). In both 
cases we found that the nucleon mass in the chiral limit, M , could be determined with 
reasonable precision by data in the mass range m = 200-350 MeV. Using data at higher 
values of m — including values above the breakdown scale of — produced results that 
were consistent with the fit results from this low-m data set. The maximum likelihood of 
the fit also dropped significantly once %PT was used to extract Mq and k\ using data that 
extended well beyond the theory's domain of validity. 

The formula (|35|) with the prior f[T2l for pr(a|M, R) assumes dimensionless coefficients. In 
any practical EFT calculation a scale A must be chosen to absorb the increasingly negative 
mass dimension of the higher-order LECs. Large (in units of GeV~ n ) coefficients in the 
function Mjv(m) emphasize the importance of this choice. In general the posterior pdf 
obtained from Eq. (|35|) depends on the units used once the "naturalness prior" is employed, 
and so the posterior pdf should be written pr(a res |D, A). We analyzed one set of pseudo- 
data on Mjv(m) choosing a range of A's. The value of Mq does not change for A's below 
400 MeV. In this range the value of k% obtained from the fit was consistent (at the la level) 
with the underlying value of this parameter used to generate the pseudo-data. We conclude 
that the nucleon-mass data used here were not sufficiently accurate to determine k\ unless 
some prior knowledge of the underlying ;^PT scale was supplied. 

An alternative way to perform the analysis of nucleon-mass data would be to assume that 
the coefficients of the terms in Eq. (154]) that are non-analytic in the quark mass (i.e. contain 
odd powers or logs of m) are known. The simplest way to perform that analysis would be 
to subtract the non-analytic pieces from the data, and fit the remaining terms: 

M analytic (m) = M + k x m 2 + fc 4 m 4 + k 9 m 6 (67) 

to the resulting points. This, then, is simply a polynomial regression, where the application 



of Bayesian methods has been well discussed in the literature [24J, [25|, [26J]. Here we instead 
fitted all coefficients, although we marginalize k n for n > 1. This also means that our results 
take account of the issue that some non-analytic terms in the chiral expansion of the nucleon 
mass are not well-constrained by other data. It would be interesting to develop priors that 
improve upon the "naturalness prior" ([T2l by incorporating what is known about these 
coefficients in the nucleon-mass fit. 

With the groundwork for a Bayesian analysis of data on M^(m) in place it would appear 
very worthwhile to apply the methods developed here to actual lattice data on this quantity, 
see e.g. Refs. HEi and references therein. Because of its ability to build in prior infor- 
mation on LECs, including information on their correlations, our method could serve as an 
alternative to approaches like bootstrapping of the data, as done e.g. in [46j. Our method 
could also be extended to the case of a glob al fit of different LECs that appear in several 
physical quantities, as advocated in Ref. [46j. Such tasks are, however, beyond the scope of 
the present work. 

More generally, the techniques developed here can be extended to EFT expansions that 
depend on more than one variable and so contain non-analytic functions of the ratios of two 
different low-energy scales, e.g. particle energy E divided by m„-. This would permit the 
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extraction of EFT parameters from multi-energy analyses of experimental data (see, for ex- 
ample, Refs. [13] and [48[ for two such treatments using conventional statistical techniques). 
Such analyses of experimental data will, however, require treatment of the case in which 
observables have non-linear dependence on the underlying EFT parameters. 

Finally, we point out that Bayesian methods are also well-suited to making predictions 
in EFTs. Using similar steps to those that produced a formula for the marginalized pdf 
pi(a res \D, A), we can calculate the pdf, pi(M N (m)\D, A), that predicts M N {m) at some m 
where there is no extant lattice calculation given the existing data D. That pdf will be 
calculable as a sum/integral over the result found for M^m) at different orders, calculated 
with different higher-order coefficients (49|. Bayesian methods therefore yield predictions for 
physical quantities that have a well-defined uncertainty. That uncertainty incorporates both 
the uncertainty due to the input data, and the uncertainty due to higher-order effects in the 
EFT expansion. As such we expect it to grow as the EFT expansion becomes less accurate 
at higher m. The ability to provide such information on the reliability of a theoretical 
calculation is one of the great benefits of EFT, and it can be fully realized using Bayesian 
techniques. 
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APPENDIX A: DETERMINATION OF PDFS: VIA THE PRINCIPLE OF MAX- 
IMUM ENTROPY 



Priors incorporate previous knowledge of the hypothesis to be tested. As explained 
above for the example of naturalness, it is not always clear which functional form is best 
suited to incorporate this knowledge in a probability density. The method of maximum 
entropy 33) has been proposed as a way to obtain pdfs in cases where testable information 
is available. Maximum entropy is a variational method that gives the least biased pdf pr(x) 
by maximizing the entropy 



S 



I 



dxpi(x) log 



pr(x) 
m(x) 



(Al) 



under the constraints of the previously available information. Here, m(x) is a 'measure' that 
renders Eq. (1A1 j) invariant under a change of variables. The most basic testable information 
is the normalization of the pdf, 

dxpr(x) = 1. (A2) 
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Let us consider the case of the prior information of Eq. (11 ip . 



M 

vi=o 



(M + l)i? 2 . 



We then need to maximize the entropy Q 

"pr(a|M, i?) 







dapr(a|M, i?) log 



ma 



+ A 



1 - / dapr(a|M,i?) 



(M+l)R 2 - J daa 2 pr(a|M, i?) 



(A3) 



as a functional of pr(a|M, i?) and a function of the Lagrange multipliers Ao and Ax- Here 
we have indicated that the prior information holds at a fixed order M (i.e. a fixed number 
of dj's) and known R. Performing this maximization one finds a maximum for: 



pr(a|M, R) = m(a)e 



-(1-A ) -A ia 2 



Assuming a uniform measure m(a) = const then gives: 

A/+1 



pr(a|M, R) 



1 



e 2^ 



(A4) 



(A5) 



If both the mean (aj) = a^o and standard deviation a a . of each parameter is known a 
similar analysis leads to a product of Gaussians for the pdf 



pr(a|ao,0- a J 



,, v 2ira a . 

j =0 "j 



exp 



2a 2 . 



(A6) 



which is the standard maximum-entropy derivation of the x 2 distribution 12j. It should be 
noted that the pdf obtained from maximum entropy depends on the choice of the measure 
m(x) of Eq. (1A1[) being constant. A different choice of m(x) leads to different results for the 
pdf given the same testable information. 



APPENDIX B: MARGIN ALIZATION OVER HIGHER-ORDER PARAMETERS 
IN THE LINEAR CASE 

In Sec. Ill CI we showed how marginalization can be used to eliminate nuisance parame- 
ters from our considerations. For fixed order M and fixed R we obtained a posterior pdf 
pr(sL res \D, M, R). 10 According to Eqs. ()24j) and (1251) we can write this posterior in the form 

pr(a res \D,M,R) oc / da marg exp (~xlug) > 



Here we consider the general case a res = (ao, . . . , a r -i)- 
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where 



x 2 

A.aug 



& Aaug& 2b • a. -\- C 



for the linear case, and A aug , b and C are given in Eqs. fl2TT) . ffT9l) . and (120]) respectively. We 
now show that marginalization over SL marg does not change the result for the expectation 
value of a res and its variance. 

To perform the integration over the in Eq. ([21]) we rewrite b as b = (b res , b marg ) 

such that 



b • a — b res • a res + b 



'marg "-marg- 



Analogously we write the matrix A aug in block form 

-4, 



'■aug 



A 1 A 2 
A 3 Aj ■ 



(Bl) 



(B2) 



with Ai an rxr, A 2 an r x (M+l— r), A 3 an (M+l— r) xr, and A A an (M+l— r) x (M+l— r) 

\-aug 

,2 



matrix, respectively. We can then write Xa U g as 

Xaug &resA\&. res 2b res • a res -|- ^rnargA^clrnar-f] 2(\) marg a res ^43) ■ 2L m argi (B3) 



where we have used that A 2 = A3. The integration over SL marg is now straightforward and 
we obtain for the marginalized posterior pdf 



pr(a res |AM, J R)= M] 



1 



n 



1 



(2tt 



,M+1- 



x exp 



27TO"j 

2 \^res^^res 



2tt(t, 



Pi , 

2(3 ■ Si r 



det A 4 
C) 



where we have defined 



T = A 1 -A 2 (A 4 )~ 1 A 3 , 

(3 = h res — h-marg^A/^) 1 A 3 , 



(B4) 



(B5) 
(B6) 



and we have included the appropriate normalization factors. The marginalized posterior is 
again of Gaussian form and the value of a res that maximizes pr(& res \D, M, R) is given by 



,0 = r- 1 /?. 



(B7) 



For the considered case the marginalization procedure does not change the results for the 
coefficients 3Lr es . In the fit with the complete set of parameters the vector a that maximizes 
the probability is given by 

fB8l 



a = Augb- 



To determine the first r components of ao we need to calculate A a ^ g . With A aug in block 
form (see Eq. (1B2[) ) . the inverse can also be written in block form, 



A -i i (Ax- A 2 A^A 3 y l -A^A 2 (A 4 - A 3 A^A 2 ) 
' ""■* ~ y-A^A^A, - A 2 Al l A 3 )- 1 [A, - A 3 A^ 1 A 2 )~ 1 ! 



(B9) 
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where the upper left matrix is just the inverse of T defined in Eq. (1B5j) . The first r compo- 
nents of ao are then given by 



res 



r^b 



res 



A^ 1 A 2 (A i -A 3 A^A 2 )- 1 b 



marg j 



(BIO) 



while Eq. ( 1B7[) reads 



a, 



■resfl 



marg • 



(Bll) 



The two results agree if 



A^A 2 {A A - A 3 A^A 2 ) 



-1 



T~ x A 2 Al 



-1 



(B12) 



Multiplying with A4 — ^A] - A 2 from the right and V = A\ — A 2 A^ A3 from the left shows 
that this matrix equation indeed holds and therefore the result for the first r components of 
a remain invariant under marginalization, i.e. a 0)res = a res - Note also that the inverse of 
T is related to the covariance matrix for the marginalized case. From Eq. (IB9I) one sees that 
the elements (cov)^ with i,j<r of the full covariance matrix are the same as the elements 
of the covariance matrix in the marginalized case since (Ai — A 2 A^ 1 A 3 ) _1 = r -1 . 
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APPENDIX C: DATA FOR TOY MODEL APPLICATION 



- X 

2 X 


d(x) 


a 


0.05 


0.31694 


0.01585 


0.1 


0.33844 


0.01692 


0.15 


0.42142 


0.02107 


0.2 


0.57709 


0.02885 


0.25 


0.56218 


0.02811 


0.3 


0.68851 


0.03443 


0.35 


0.73625 


0.03681 


0.4 


0.87280 


0.04364 


0.45 


1.0015 


0.0501 


0.5 


1.0684 


0.0534 


VBLE XIV: Data set 


2 X 


d(x) 


a 


0.1 


0.37385 


0.01869 


0.2 


0.51985 


0.02599 


0.3 


0.68911 


0.03446 


0.4 


0.81065 


0.04053 


0.5 


1.0268 


0.0513 


0.6 


1.2747 


0.0637 


0.7 


1.8016 


0.0901 


0.8 


2.2042 


0.1102 


0.9 


2.7660 


0.1383 


1 


4.3970 


0.2198 



TABLE XV: Data set D 2 
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